# Practical Introduction to Model Training

**In this notebook, we will train a spaCy named entity recognition model (NER) using data from [LitBank](https://github.com/dbamman/litbank), an annotated dataset of 100 works of English-language fiction.**

Steps:  
âœ… Load annotation data from LitBank  
âœ… Create train and validation sets  
âœ… Train NER from scratch using only the EN language object  
âœ… Visualize the results and compare the model's predictions against the original data  
âœ… Is the model sufficiently useful for research? What would need to be improved and changed?  

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1ndN5qqGF-ICayAeZBKEGp7Qqi2-bTvu6?usp=sharing)

## Installing dependencies & loading data
First, we install spaCy (to train the model), sklearn (to split the data for training), and tqdm (for a nice progress bar).

We also clone the GitHub repo with the LitBank data.

In [None]:
#Install libraries
!pip install spacy sklearn tqdm
#Clone LitBank
!git clone https://github.com/dbamman/litbank.git
import spacy
#Show what version of spaCy we're using
print(f'Using spaCy version {spacy.__version__}')

Next, we creat a list of the text files in the `litbank/entities/brat` directory and display the number of texts.

In [2]:
#Imports the Path library
from pathlib import Path
#Moves to the path litbank/entities/brat
entities_path = Path.cwd() / 'litbank' / 'entities' / 'brat'
#Creates a list of text files in the path above
text_files = [f for f in entities_path.iterdir() if f.suffix == '.txt']
#Counts how many text files there are
assert len(text_files) == 100
#Show how many text files have been imported
print(f'[*] imported {len(text_files)} files')

[*] imported 100 files


## Process LitBank data
Here, we run each of the LitBank text files through spaCy, but only using the sentenceizer (i.e. not all the other pieces of the default English pipeline, because we want to train a new model, not use its existing predictions). We also extract each of the annotations in the LitBank text files (which should refer to people, places, etc.) and add them to an entity list for that text.

In [None]:
# for each file, create a Doc object and add the annotation data to doc.ents
# our output is a list of Doc objects 
#Import spaCy, tqdm, and various utilities from spaCy
import spacy 
from tqdm.notebook import tqdm
from spacy.tokens import Span, DocBin
from spacy.util import filter_spans

#Creates a list of Doc objects that are the output from spaCy
docs = []

#Use a blank spaCy model
nlp = spacy.blank("en")
#Add the stentencizer ot break it up into sentences
nlp.add_pipe('sentencizer') # used in training assessment

#With each text file, while showing a progress bar
for text_file in tqdm(text_files):
    #Read the file
    doc = nlp(text_file.read_text())
    #Create a file for the extracted annotations
    annotation_file = (entities_path / (text_file.stem +'.ann'))
    #Split the annotations by new lines
    annotations = annotation_file.read_text().split('\n')
    #Create a list for the entities
    ents = []
    #For each annotation
    for annotation in annotations[:-1]:
        #Split the data based on tab characters to seaprate label, start, and end
        label, start, end = annotation.split('\t')[1].split()
        #Span is the text in the doc corresponding to the annotation
        span = doc.char_span(int(start), int(end), label=label)
        #Handles errors
        if span: # when start and end do not match a valid string, spaCy returns a NoneType span
            ents.append(span)
    #Removes duplicated or overlapping words
    filtered = filter_spans(ents)
    #The entities we want are the filtered list
    doc.ents = filtered
    #Append the spaCy-analyzed text to the list of docs
    docs.append(doc)
    

assert len(docs) == 100

## Split data into sets for training and validation
We don't want to use all the data for training, because that would leave us without any data to use for checking the model's accuracy. The *training* data is what the model actually learns from; the *validation* data is the data that's used to choose the best model from multiple training runs; the *test* data is the "gold standard" of "right" answers.

If you read general-purpose descriptions of the different data sets for model training, you may see references to *hyperparamters* (like the "learning rate"). spaCy's built-in model training provides sensible defaults that you don't necessarily need to modify, but if you're interested in the details of what *could* be modified, you can check the [documentation about the training config file](https://spacy.io/usage/training#config).

In [5]:
# Split the data into sets for training and validation 
from sklearn.model_selection import train_test_split

#Split the data into the training set (90%) and validation set (10%)
train_set, validation_set = train_test_split(docs, test_size=0.1)
#Split the validation set into the actual validation set (70%) and test set (30%)
validation_set, test_set = train_test_split(validation_set, test_size=0.3)
#Print how many docs are in each set
print(f'ðŸš‚ Created {len(train_set)} training docs')
print(f'ðŸ˜Š Created {len(validation_set)} validation docs')
print(f'ðŸ§ª Created {len(test_set)} test docs')

ðŸš‚ Created 90 training docs
ðŸ˜Š Created 7 validation docs
ðŸ§ª Created 3 test docs


### Save the data sets
From here, we save the training, validation, and test data sets.

In [6]:
#Import DocBin, a format for saving a collection of spaCy Doc objects
from spacy.tokens import DocBin

#Define a DocBin for training data
train_db = DocBin()
#For each doc in the training set
for doc in train_set:
    #Add it to the training DocBin
    train_db.add(doc)
#Save the resulting file
train_db.to_disk("./train.spacy")

# Define a DocBin for validation data, and do the same as above
validation_db = DocBin()
for doc in validation_set:
    validation_db.add(doc)
validation_db.to_disk("./dev.spacy") 

# Define a DocBin for test data, and do the same as above
test_db = DocBin()
for doc in test_set:
    test_db.add(doc)   
test_db.to_disk("./test.spacy") 

Here, we check to make sure the files all exist and are of reasonable sizes given the way we split them (90% training, then splitting that remaining 10% into 70% validation and 30% test.)

In [7]:
!ls -al train.spacy dev.spacy test.spacy

-rw-r--r-- 1 root root  115753 Dec 23 08:20 dev.spacy
-rw-r--r-- 1 root root   53751 Dec 23 08:20 test.spacy
-rw-r--r-- 1 root root 1406959 Dec 23 08:20 train.spacy


## Create training configuration file
Here, we create the configuration file we'll need to actually run the training. We're using English language, the named-entity recognition (NER) pipeline, and otherwise just the defaults.

In [9]:
!python3 -m spacy init config ./config.cfg --lang en --pipeline ner -F

[38;5;3mâš  To generate a more effective transformer-based config (GPU-only),
install the spacy-transformers package and re-run this command. The config
generated now does not use transformers.[0m
[38;5;4mâ„¹ Generated config template specific for your use case[0m
- Language: en
- Pipeline: ner
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
[38;5;2mâœ” Auto-filled config with all values[0m
[38;5;2mâœ” Saved config[0m
config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


## Model training
The following code starts the training. The training output goes into a directory called `output`, and we define the paths to the training (train.spacy) and the validation (dev.spacy) data.

In [11]:
!python3 -m spacy train config.cfg --output ./output --paths.train train.spacy --paths.dev dev.spacy

[38;5;4mâ„¹ Saving to output directory: output[0m
[38;5;4mâ„¹ Using CPU[0m
[1m
[2021-12-23 08:22:05,786] [INFO] Set up nlp object from config
[2021-12-23 08:22:05,792] [INFO] Pipeline: ['tok2vec', 'ner']
[2021-12-23 08:22:05,794] [INFO] Created vocabulary
[2021-12-23 08:22:05,795] [INFO] Finished initializing nlp object
[2021-12-23 08:22:11,376] [INFO] Initialized pipeline components: ['tok2vec', 'ner']
[38;5;2mâœ” Initialized pipeline[0m
[1m
[38;5;4mâ„¹ Pipeline: ['tok2vec', 'ner'][0m
[38;5;4mâ„¹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
  0       0          0.00   1072.88    0.00    0.00    0.00    0.00
  2     200      18889.69  63358.78   35.62   28.97   46.21    0.36
  4     400      11414.18  27850.79   53.58   60.92   47.83    0.54
  6     600      24399.51  23338.55   54.53   64.79   47.08    0.55
  8     800      20359.37  18970.20   57.03   62

## Test the new model
Finally, we can check how the model we just trained performs, using the test data set for comparison. The closer the model results are to the human-annotated test set, the better the model is performing. We'll start with running the model on a random exerpt from the test set.

In [22]:
#Imports the random library to choose a random exerpt.
import random
#Displacy shows a nice visualization of spaCy data, including entities on text
from spacy import displacy 

#Load the model we just trained
new_nlp = spacy.load("output/model-last")
#Pick a random exerpt from the test data set.
val_doc = random.choice(test_set)
#Run the new model on the random exerpt
doc = new_nlp(val_doc.text)

#Show the first 100 words of the random document.
displacy.render(doc[:100], jupyter=True, style="ent")

To compare, let's display the original, human-generated annotations.

In [23]:
# Display the original annotations in the same style
displacy.render(val_doc[:100], jupyter=True, style="ent")

It's not always easy to see the differences right away: walk through the human-annotated text, entity by entity, and then check what happened with the model at that same point in the text. Some common errors include getting the entity right but the label wrong (e.g. switching LOC/PER), and including too many words in the entity, in addition to just missing the entity entirely.

## Evaluation
Is the model sufficiently useful for research? What would need to be improved and changed?

In [None]:
!python -m spacy evaluate output/model-last test.spacy --output litbank.json